quantization bit
Hierarchical Channel-spatial Encoding for Communication-efficient Collaborative Learning
It witnesses that the collaborative learning (CL) systems often face the performance bottleneck of limited bandwidth, where multiple low-end devices continuously generate data and transmit intermediate features to the cloud for incremental training. To this end, improving the communication efficiency by reducing traffic size is one of the most crucial issues for realistic deployment. Existing systems mostly compress features at pixel level and ignore the characteristics of feature structure, which could be further exploited for more efficient compression. In this paper, we take new insights into implementing scalable CL systems through a hierarchical compression on features, termed Stripe-wise Group Quantization (SGQ). Different from previous unstructured quantization methods, SGQ captures both channel and spatial similarity in pixels, and simultaneously encodes features in these two levels to gain a much higher compression ratio. In particular, we refactor feature structure based on inter-channel similarity and bound the gradient deviation caused by quantization, in forward and backward passes, respectively. Such a double-stage pipeline makes SGQ hold a sublinear convergence order as the vanilla SGD-based optimization. Extensive experiments show that SGQ achieves a higher traffic reduction ratio by up to 15.97 and provides 9.22 image processing speedup over the uniform quantized training, while preserving adequate model accuracy as FP32 does, even using 4-bit quantization. This verifies that SGQ can be applied to a wide spectrum of edge intelligence applications.
0e230b1a582d76526b7ad7fc62ae937d-AuthorFeedback.pdf
More extensive and thorough experiments are needed. Sub 1-bit quantization is only available through FleXOR. Or do some weights use >1b while other can use much less? The reviewer did not find results in the paper that used quantized inputs. "Input weight format" should read "Internal weight format."
Review for NeurIPS paper: HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks
Summary and Contributions: This paper suggests that Hessian trace can be a good metric to automate the process to decide the number of quantization bits for each layer unlike previous attempts such as using top Hessian eigenvalue. Some mathematical analysis to support that Hessian trace is better than top Hessian eigenvalue is provided while memory footprint and mode accuracy are compared on several models using ImageNet database. This paper also shows that Hessian trace computations can be simplified by following the Hutchinson's algorithm. Strengths: - Hessian-related metrics have been widely adopted to present different sensitivity of layers. This paper compares a few different Hessian-related approaches and provides some mathematical analysis to claim why Hessian trace can be considered as a good metric to produce some optimal number of quantization bits.
Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning
Qiang, Xianke, Liu, Hongda, Zhang, Xinran, Chang, Zheng, Liang, Ying-Chang
Abstract--Large Artificial Intelligence Models (LAMs) powered by massive datasets, extensive parameter scales, and extensive computational resources, leading to significant transformations across various industries. Y et, their practical deployment on resource-limited mobile edge devices is hindered by critical challenges such as data privacy, constrained resources, and high overhead costs. Addressing this gap, this paper proposes a novel framework, named Quantized Split Federated Fine-T uning Large AI Model (SFLAM). By partitioning the training load between edge devices and servers using a split learning paradigm, SFLAM can facilitate the operation of large models on devices and significantly lowers the memory requirements on edge devices. Additionally, SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency while concurrently reducing energy consumption and communication latency. A theoretical analysis exploring the latency-energy trade-off is presented, and the framework's efficacy is validated via comprehensive simulations. The findings indicate that SFLAM achieves superior performance in terms of learning efficiency and scalability compared to conventional methods, thereby providing a valuable approach for enabling advanced AI services in resource-constrained scenarios. I. Introduction The advent of Large AI Models (LAMs), such as Chat-GPT and DeepSeek, marked a significant leap in AI capabilities, powered by their extensive parameter scales, large-scale datasets, and substantial computational resources [1]. As user demand for ubiquitous AI access and real-time, personalized experiences grows, deploying and training these models on mobile devices becomes increasingly relevant [2]. T o meet these escalating demands, fine-tuning, which involves adapting pre-trained models with domain-specific data, has become a widely adopted and efficient strategy for enhancing LAM performance on specialized tasks, offering a cost-effective path to superior results.
Robust Iterative Value Conversion: Deep Reinforcement Learning for Neurochip-driven Edge Robots
Kadokawa, Yuki, Kodera, Tomohito, Tsurumine, Yoshihisa, Nishimura, Shinya, Matsubara, Takamitsu
A neurochip is a device that reproduces the signal processing mechanisms of brain neurons and calculates Spiking Neural Networks (SNNs) with low power consumption and at high speed. Thus, neurochips are attracting attention from edge robot applications, which suffer from limited battery capacity. This paper aims to achieve deep reinforcement learning (DRL) that acquires SNN policies suitable for neurochip implementation. Since DRL requires a complex function approximation, we focus on conversion techniques from Floating Point NN (FPNN) because it is one of the most feasible SNN techniques. However, DRL requires conversions to SNNs for every policy update to collect the learning samples for a DRL-learning cycle, which updates the FPNN policy and collects the SNN policy samples. Accumulative conversion errors can significantly degrade the performance of the SNN policies. We propose Robust Iterative Value Conversion (RIVC) as a DRL that incorporates conversion error reduction and robustness to conversion errors. To reduce them, FPNN is optimized with the same number of quantization bits as an SNN. The FPNN output is not significantly changed by quantization. To robustify the conversion error, an FPNN policy that is applied with quantization is updated to increase the gap between the probability of selecting the optimal action and other actions. This step prevents unexpected replacements of the policy's optimal actions. We verified RIVC's effectiveness on a neurochip-driven robot. The results showed that RIVC consumed 1/15 times less power and increased the calculation speed by five times more than an edge CPU (quad-core ARM Cortex-A72). The previous framework with no countermeasures against conversion errors failed to train the policies. Videos from our experiments are available: https://youtu.be/Q5Z0-BvK1Tc.